feat(core): add opt-in periodic ping for connection health monitoring#1717
feat(core): add opt-in periodic ping for connection health monitoring#1717travisbreaks wants to merge 4 commits intomodelcontextprotocol:mainfrom
Conversation
🦋 Changeset detectedLatest commit: 5116b16 The changes in this PR will be included in the next version bump. This PR includes changesets to release 7 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
@modelcontextprotocol/client
@modelcontextprotocol/server
@modelcontextprotocol/express
@modelcontextprotocol/fastify
@modelcontextprotocol/hono
@modelcontextprotocol/node
commit: |
51a9eae to
b41b17e
Compare
|
@claude review |
Implements the missing periodic ping functionality specified in issue modelcontextprotocol#1000. Per the MCP specification, implementations SHOULD periodically issue pings to detect connection health, with configurable frequency. Changes: - Add `pingIntervalMs` option to `ProtocolOptions` (disabled by default) - Implement `startPeriodicPing()` and `stopPeriodicPing()` in Protocol - Client starts periodic ping after successful initialization - Server starts periodic ping after receiving initialized notification - Timer uses `unref()` so it does not prevent clean process exit - Ping failures are reported via `onerror` without stopping the timer - Timer is automatically cleaned up on close or unexpected disconnect Fixes modelcontextprotocol#1000 Co-authored-by: Br1an67 <932039080@qq.com>
b41b17e to
9e1b5da
Compare
Mirrors the Go SDK pattern where the initialized notification handler is a named method rather than an inline closure.
|
@claude review |
The reconnection path (when transport.sessionId is set) returns early, skipping the startPeriodicPing() call that runs after normal initialization. Since _onclose() clears the ping timer during disconnection, connection health monitoring silently stops working after any reconnect. Adding startPeriodicPing() before the early return ensures pings resume on reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@claude review |
| async close(): Promise<void> { | ||
| this.stopPeriodicPing(); | ||
| await this._transport?.close(); | ||
| } |
There was a problem hiding this comment.
🔴 When close() is called while a periodic ping is in-flight, _onclose() rejects all pending response handlers with SdkError(ConnectionClosed), causing the ping catch block to fire this._onerror() with that error. Users monitoring onerror for connection health will receive spurious Connection closed errors during normal, intentional shutdowns that are indistinguishable from genuine connection failures. Fix: add a _closing boolean flag set before closing the transport, and check it in the ping catch block to suppress the error.
Extended reasoning...
The Race Condition
The bug lives in startPeriodicPing() at packages/core/src/shared/protocol.ts around lines 778-804. The setInterval callback is async, so once it fires and awaits _requestWithSchema(), JavaScript cannot cancel it. Clearing the interval only prevents future firings; the already-started async function continues running. This is the core of the race.
Exact Code Path
- The setInterval fires. The async callback begins executing and calls _requestWithSchema({ method: 'ping' }, ...), which registers a response handler in _responseHandlers and suspends at the await.
- Before the server responds, the caller invokes close().
- close() calls stopPeriodicPing(), which does clearInterval(this._pingTimer). The in-flight async callback has already started; clearing the interval only prevents future firings.
- close() then calls this._transport?.close(), which triggers the transport onclose callback.
- That callback invokes _onclose(), which iterates over all entries in _responseHandlers and rejects each one with new SdkError(SdkErrorCode.ConnectionClosed, 'Connection closed').
- The pending _requestWithSchema promise now rejects with the ConnectionClosed error.
- The catch block in startPeriodicPing executes: since SdkError extends Error, this._onerror(error) fires.
- The user's onerror callback receives a SdkError(ConnectionClosed) during a normal, user-initiated shutdown.
Why Existing Code Does Not Prevent It
The close() method does call stopPeriodicPing() before this._transport?.close(), which looks like it should protect against this. But stopPeriodicPing() only prevents new timer firings. It cannot retroactively cancel an async function that has already started and is awaiting a promise. There is no mechanism in the ping catch block to distinguish intentional close from a genuine connection failure.
Impact
Any consumer that sets onerror to monitor connection health (the primary intended use case per the PR description) will see false positive ConnectionClosed errors every time close() is called while a ping happens to be in-flight. Since pings fire on a regular interval, the race is probabilistic but reliably reproducible at scale or in tests. More critically, these errors are indistinguishable from a real remote-side failure, so automated reconnection logic or alerting built on onerror will misfire.
How to Fix
Add a private _closing boolean field to Protocol, initialized to false. In close(), set this._closing = true before calling stopPeriodicPing() and this._transport?.close(). In the ping catch block, guard the onerror call:
} catch (error) {
if (!this._closing) {
this._onerror(error instanceof Error ? error : new Error('Periodic ping failed: ' + String(error)));
}
}Reset _closing to false in _onclose() after cleanup so that a reconnected protocol instance works correctly. The _closing flag cleanly suppresses the error regardless of what exact error code the transport produces on shutdown.
Step-by-Step Proof
T=0 setInterval fires; pingCallback() begins executing
T=0 _requestWithSchema registers handler H in _responseHandlers, yields at await
T=1 close() is called by user
T=1 stopPeriodicPing() clears the timer -- in-flight pingCallback is unaffected
T=1 this._transport.close() called
T=1 transport triggers onclose -> _onclose() runs
T=1 _onclose(): for each handler in responseHandlers -> handler(SdkError(ConnectionClosed))
T=1 Handler H rejects; pingCallback resumes in catch block
T=1 catch(error): this._onerror(error) -- SPURIOUS ERROR DELIVERED TO USER
T=1 user onerror fires with 'Connection closed' during intentional shutdown
| this._pingTimer = setInterval(async () => { | ||
| try { | ||
| await this._requestWithSchema({ method: 'ping' }, EmptyResultSchema, { | ||
| timeout: this._pingIntervalMs | ||
| }); | ||
| } catch (error) { | ||
| this._onerror(error instanceof Error ? error : new Error(`Periodic ping failed: ${String(error)}`)); | ||
| } | ||
| }, this._pingIntervalMs); |
There was a problem hiding this comment.
🟡 The setInterval-based ping loop can briefly have 2 concurrent pings in-flight simultaneously. Since the ping timeout is also set to _pingIntervalMs, the timeout for ping N and the start of ping N+1 are both scheduled for the same wall-clock time — their ordering in the JS event loop is non-deterministic. Using setTimeout with rescheduling after each ping completes would strictly enforce the documented one-at-a-time behavior.
Extended reasoning...
What the bug is
In startPeriodicPing() (protocol.ts:772-780), setInterval fires every _pingIntervalMs regardless of whether the previous async callback has finished. The ping timeout passed to _requestWithSchema is also this._pingIntervalMs. This means the internal setTimeout that will reject ping N with a timeout error and the next setInterval tick that fires ping N+1 are both scheduled for the same wall-clock time (T + pingIntervalMs from ping N start). Which fires first is non-deterministic: it depends on the JS engine timer queue ordering, which is not specified.
The specific code path
The relevant code:
this._pingTimer = setInterval(async () => {
try {
await this._requestWithSchema({ method: 'ping' }, EmptyResultSchema, {
timeout: this._pingIntervalMs // same value as the interval
});
} catch (error) {
this._onerror(...);
}
}, this._pingIntervalMs);
Because setInterval does not await the async callback, it fires the next tick at T + pingIntervalMs unconditionally. The request inside the callback has its own setTimeout-based timeout also set to pingIntervalMs. Both macrotasks are enqueued for approximately the same wall-clock time.
Why existing code does not prevent it
The idempotency guard only prevents duplicate timers from multiple startPeriodicPing() calls. It does nothing to serialize individual ping invocations. Once the interval is running, each tick creates a new promise/request that executes concurrently with any still-in-flight ping.
Addressing the refutation and impact
The refutation correctly notes that the overlap is bounded: because the timeout equals the interval, each lingering ping will be canceled at approximately the same time the next one starts. The worst case is 2 concurrent pings, not N. In practice, on a well-functioning connection, pings respond well within the interval and there is no overlap. Only when a ping is slow (response time approaching pingIntervalMs) does the race condition manifest, and even then it is transient and bounded.
However, the PR description explicitly claims: "Ping timeout equals the interval: if a ping takes longer than one interval to respond, it fails with a timeout error rather than stacking up." This is imprecise. Due to timer imprecision, the new ping can start before the old one is cleaned up, meaning they do briefly stack up (2 at once). The stated guarantee is not actually enforced by the implementation.
Step-by-step proof
- T=0: setInterval fires, ping Better validation of data coming over the wire #1 is sent. A setTimeout(reject, pingIntervalMs) is registered inside _requestWithSchema.
- T=pingIntervalMs: Both the interval tick (ping Race condition between transport start and server connection #2) and the timeout for ping Better validation of data coming over the wire #1 are queued as macrotasks. Two timers set for the same target time have implementation-defined ordering.
- If the interval fires first: ping Race condition between transport start and server connection #2 is sent and added to _responseHandlers while ping Better validation of data coming over the wire #1's handler is still present. Both are in-flight simultaneously for a brief moment before ping Better validation of data coming over the wire #1's timeout fires.
- If the timeout fires first: ping Better validation of data coming over the wire #1 is cleaned up, then ping Race condition between transport start and server connection #2 is sent — matching the documented behavior.
How to fix it
Replace setInterval with a self-rescheduling setTimeout pattern that only schedules the next ping after the current one completes (success or failure). This strictly serializes pings and matches the documented one-at-a-time intent, at the cost of the interval becoming pingIntervalMs + RTT rather than exactly pingIntervalMs — an acceptable trade-off for health monitoring.
| /** | ||
| * Starts sending periodic ping requests at the configured interval. | ||
| * Pings are used to verify that the remote side is still responsive. | ||
| * Failures are reported via the {@linkcode onerror} callback but do not | ||
| * stop the timer; pings continue until the connection is closed. | ||
| * | ||
| * This is called automatically at the end of {@linkcode connect} when | ||
| * `pingIntervalMs` is set. Subclasses that override `connect()` and | ||
| * perform additional initialization (e.g., the MCP handshake) may call | ||
| * this method after their initialization is complete instead. | ||
| * | ||
| * Has no effect if periodic ping is already running or if no interval | ||
| * is configured. | ||
| */ |
There was a problem hiding this comment.
🟡 The JSDoc for startPeriodicPing() claims "This is called automatically at the end of connect() when pingIntervalMs is set," but Protocol.connect() ends with await this._transport.start() and never calls startPeriodicPing(). Any custom Protocol subclass that does not override connect() will silently never start periodic pings — update the JSDoc to accurately describe that only Client and Server call this after their respective initialization steps.
Extended reasoning...
What the bug is
The JSDoc comment on startPeriodicPing() (protocol.ts lines 759-762) contains a false statement: "This is called automatically at the end of connect() when pingIntervalMs is set." This implies the base Protocol.connect() is responsible for starting the ping timer when the option is configured.
The specific code path
Looking at Protocol.connect(), the method sets up transport callbacks and ends with await this._transport.start(). There is no call to startPeriodicPing() anywhere in the base implementation. The actual callers are: (1) Client.connect() calls this.startPeriodicPing() after the full MCP initialization handshake completes (and also on the reconnection fast-path); (2) Server._handleInitialized() calls this.startPeriodicPing() when the notifications/initialized notification is received.
Why existing code does not prevent the confusion
The second sentence of the JSDoc partially acknowledges the real behavior: "Subclasses that override connect() and perform additional initialization may call this method after their initialization is complete instead." However, this reads as an opt-out from the stated automatic behavior, not a correction of it. The false first sentence still stands.
Impact
Any developer implementing a custom Protocol subclass who passes pingIntervalMs and does not override connect() will get no periodic pings, despite the documentation promising automatic invocation. The ping timer simply never starts. This is a silent failure — no error is thrown, and _pingIntervalMs is set, so there is no obvious indication that something is wrong.
Step-by-step proof
- Create a class extending
Protocolwithout overridingconnect() - Instantiate with
new MyProtocol({ pingIntervalMs: 30_000 }) - Call
await myProtocol.connect(transport)— this runsProtocol.connect(), which callstransport.start()and returns startPeriodicPing()is never invoked;_pingTimerremainsundefined- No pings are ever sent, despite
_pingIntervalMsbeing set to 30,000
Fix
Update the JSDoc to accurately state that Client and Server are the actual callers, and that custom subclasses must call this method explicitly after their own initialization is complete.
Regarding the duplicate refutation
One verifier flagged this as a duplicate of bug_004. Whether or not another report covers the same issue, the bug is independently confirmed and real. The JSDoc is factually wrong regardless of duplication — the duplication concern is a process question, not a technical refutation.
Summary
pingIntervalMsoption toProtocolOptionsthat enables automatic periodic pings to verify the remote side is still responsiveinitializednotificationonerrorwithout stopping the timerunref()so it does not prevent clean Node.js process exitPer the MCP specification: "Implementations SHOULD periodically issue pings to detect connection health" with "configurable frequency."
Usage
Key design decisions
pingIntervalMsto enableunref()on the timer: the periodic ping does not keep the Node.js process alive if all other work is doneRelationship to other PRs
This complements PR #1712 (host process watchdog for stdio orphan detection). That PR detects when the host process is gone at the OS level; this PR detects when the remote MCP peer is unresponsive at the protocol level.
Fixes #1000
Test plan
should not send periodic pings when pingIntervalMs is not set(default behavior unchanged)should send periodic pings when pingIntervalMs is set and startPeriodicPing is calledshould stop periodic pings on closeshould report ping errors via onerror without stopping the timershould not start duplicate timers if startPeriodicPing is called multiple timesshould stop periodic pings when transport closes unexpectedlyChangeset
@modelcontextprotocol/core: minor,@modelcontextprotocol/client: minor,@modelcontextprotocol/server: minorCo-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com